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Abstract 



This paper studies sequential methods for recovery of sparse signals in high dimensions. When compared to fixed 
sample size procedures, in the sparse setting, sequential methods can result in a particularly large reduction in the 
number of samples needed for reliable signal support recovery. Starting with a lower bound, we show any sequential 
sampling procedure fails in the high dimensional limit provided the average number of measurements per dimension 
is less then D(Po\\Pi)~ logs, where s is the level of sparsity and D(Po\\Pi) the Kullback-Leibler divergence 
between the underlying distributions. An extension of the Sequential Probability Ratio Test (SPRT) which requires 
complete knowledge of the underlying distributions is shown to achieve this bound. We introduce a simple procedure 
termed Sequential Thresholding which can be implemented with limited knowledge of the underlying distributions, 
and guarantees exact support recovery provided the average number of measurements per dimension grows faster 
than D(Po\ log s, achieving the lower bound. For comparison, we show any non-sequential procedure fails 
provided the number of measurements grows at a rate less than D(Pi\\Pq)~ 1 Iogn, where n is the total dimension 
of the problem. 



Signal support recovery in high dimensions is a fundamental problem arising in many aspects of science and 
engineering. The goal of the basic problem is to determine, based on noisy observations, a sparse set of elements 
that somehow differ from the others. 

In this paper we study the following problem. Consider a support set S C {1, . . . , n} and a random variable Y% 
such that 



where Pq and P± are probability measures on y. The dimension of the problem, n, is large - perhaps thousands or 
millions or more - but the support set S is sparse in the sense that the number of elements following distribution 
Pi is much less than the dimension, i.e., |<S| = s -C n. The goal of the sparse recovery problem is to identify the 
set S from multiple independent realizations of the random variables Y\, Yi, . . . , Y n . 

This work was supported in part by AFOSR grant number FA9550-09-1-0140 and FA9550-09- 1-0643. Portions of this work were presented 
at the Asilomar Conference on Signals, Systems, and Computers, November 2011 (1), and the International Symposium on Information Theory 
in Saint Petersburg, August 2011 \2\. The authors would like to thank both Cun-Hui Zhang and Jarvis Haupt for helpful discussions. 



I. Introduction 




i = 1, . . . , n 



2 



The conventional theoretical treatment of this problem assumes that the samples are collected prior to data analysis 
in what is refereed to as a non-sequential (or fixed sample size) setting. In this case, m samples of each component 
are made (m samples of Y{ are gathered for each index i) and any test for inclusion in S is performed after the 
data is collected. The fundamental limits of reliable recovery are readily characterized in terms of Kullback-Leibler 
divergence and dimension (see Sec. IHI-Bl i. 

On the other hand, information gathering systems encountered in practice are often tasked with measuring some 
temporal signal or process, leaving the potential for the system to adapt the sampling approach based on prior 
observations. In this sequential setting, the decision to take an additional sample of any component i is based on 
prior realizations of that component. Herein lies the advantage of sequential methods: if prior samples indicate a 
particular component belongs (or doesn't belong) to S with sufficient certainty, measurement of that component can 
cease, and resources can be diverted to a more uncertain element. The focus of this paper is on the fundamental 
limits of recovery of such sequential systems. 

A. Main Contributions 

The results presented in this paper are in terms of asymptotic rate at which the average number of samples per 
dimension, denoted m, must increase with n to ensure exact recovery of S for any fixed distributions Pq and Pi. 
For a given procedure, the probability of correctly recovering the set S depends on the triple (n,s,m). As the 
dimension of the problems grows (as n — > oo), correctly recovering S becomes increasingly difficult, and m must 
also increase if we hope to recover S. One manner in which we can quantify the performance of a procedure is 
the rate at which m must grow as a function of n and s to ensure recovery of S. 

As such, the main contributions are 1 ) to derive a lower bound on the number of measurements required for success 
of any sequential procedure in the sparse setting, 2) introduce a simple sequential procedure termed Sequential 
Thresholding which can be implemented with limited knowledge of the distributions and show this simple procedure 
is asymptotically optimal, 3) compare this procedure to the known optimal SPRT which requires full knowledge 
of Po and Pi, and lastly 4) compare these results to the performance of any non-sequential procedure. Table J] 
summarizes these results. 

TABLE I 

Average number of measurements per dimension for exact support recovery in high dimensional limit 



any non-sequential 


log 71 

m - D(P 1 \\P ) 


necessary 




any sequential 


\ log s 
m - D(Poll-Pl) 


necessary 




SPRT based procedure 


m ~~ logs 


sufficient 


requires exact knowledge of 
Po, Pi 


Sequential Thresholding 


log s 
m > D(P \\P 1 ) 


sufficient 


can be implemented without 
exact knowledge of Pi 



These developments are intriguing primarily for two reasons. First, sequential procedures can succeed when 



3 



the number of measurements per dimension increases at a rate logarithmic in the level of sparsity, i.e. logs. In 
contrast, well known results from statistical testing show non-sequential procedures require the average number of 
measurements per dimension to increase at a rate logarithmic in the dimension, i.e. log n. Secondly, and perhaps 
equally as surprising, Sequential Thresholding, a simple, practical procedure introduced here, achieves optimal 
performance as the dimension grows large. 

B. Motivation 

The problem of sparse signal recovery using sequential measurements arises in a number of commonly encoun- 
tered problems in science and engineering. In communications, spectrum sensing for cognitive radio aims to identify 
unoccupied communication bands in the electromagnetic spectrum. Most bands will be occupied by primary users, 
but these users may come and go, leaving certain bands momentarily open and available for use by secondary 
transmitters. As the noisy samples of these occupied and un-occupied bands are collected in a temporal manner, 
sequential methods are a natural fit to map the occupation of the spectrum; in fact, recent work in spectrum sensing 
has given considerable attention to such approaches (see, for example Q, J4)). 

Another captivating example a of sparse recovery problem where sequential methods are currently employed is that 
of the Search for Extraterrestrial Intelligence (SETI) project. Researchers at the SETI institute sense for narrowband 
electromagnetic energy from distant star systems using large antenna arrays, with the hopes of finding extraterrestrial 
transmission. The dimension of the problem consists of over 100 billion stars in the Milky Way alone, each with 
9 million potential 'frequencies' in which to sense for narrow band energy. The subset of planetary systems with 
extraterrestrial transmission is sparse (as, of course, SETI is yet to make a contact). Moreover, while researchers 
may have a good idea of the distribution of the background noise, Pq, complete knowledge of Pi is of course 
not available, making procedures based on sequential probability ratio testing impossible to implement. Roughly 
speaking, researchers at SETI use a sequential procedure that repeatedly tests energy levels against a threshold 
up to five times 0, |6|. If any of the up to five measurements are below the threshold, the procedure passes to 
the next frequency/star. Should the measurements exceed the threshold on all five occasions, measurements of that 
star and frequency are passed to an operator for further inspection. This procedure is closely related to Sequential 
Thresholding. Sequential Thresholding results in substantial gains over fixed sample size procedures and, unlike 
the SPRT, it can be computed without full knowledge of Pi. 

Sparse recovery also underlies a number of recent micro-array studies in biology. Here, biologists attempt to 
estimate a sparse set of genes or proteins that are critically involved in a certain process or function (such as 
virus replication). The biologists may have good estimates of the null distribution, Pq, but not of the alternative 
distribution, Pi, again making procedures based on the SPRT impractical. A number of recent publications have 
implemented various multi-stage (thus sequential) procedures [7|-[10| that operate without full knowledge of Pi. 
The proposed procedures in general aim to reduce the total dimension of the problem and then employ traditional 
recovery techniques. While a number of authors suspect such sequential methods result in increased sensitivity, the 
gains are not theoretically quantified. 
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C. Related Work 

Many of the fundamental results in sequential analysis were developed by Wald, and formalized in his book, 
Sequential Analysis ifTTl , Most relevant to the developments here, the sequential probability ratio test (SPRT) was 
shown to be optimal in terms of minimizing the error probabilities and expected number of measurements for a 
simple binary hypothesis test. A number of issues arise, including loss of optimality, when exact knowledge of the 
distributions is unavailable. In this scenario, a number of procedures have been proposed for composite alternatives, 
including those by Wald Ifl2ll . which usually incorporate a weighting over parametric families, or generalized 
likelihood ratio testing 11131 . 

Sequential testing for sparse signals was perhaps first studied by Posner in fl4l . Motivated by the the problem of 
finding a lost satellite in the sky, Posner aimed to minimize the expected search time using a multistage procedure. 
Posner's procedure is closely related to the high dimensional extension of the sequential probability ratio test (see 
Sec. |IV] for details). Sequential approaches to the high dimensional sparse recovery problem have recently been 
given increased attention, perhaps motivated by the success of exploiting sparsity in other areas. Related work 
includes ifTSl . fl6l . in which the authors extend the work of lfl4ll to include multiple targets, encompassing a more 
general model. 

In some of the first work to theoretically quantify the gains of sequential methods for high dimensional recovery 
IfTTl . 08], the authors proposed a sequential procedure for recovery in additive Gaussian noise, termed Distilled 
Sensing. Our Sequential Thresholding approach is similar to the so-called Distilled Sensing method, however there 
are two main distinctions. First, the results in this paper are applicable to a larger class of problems characterized 
by finite Kullback-Leibler divergence; the Distilled Sensing approach is specific to the Gaussian setting. Second, 
here we are concerned with the probability of error in exact recover of S; Distilled Sensing controls the false 
discovery and non-discovery rates which is less demanding than the error rate control. Also closely related to the 
lower bounds developed here are those of |19j, which are in terms of the expected set difference, are restricted to 
the Gaussian setting, and published after the initial work in (TJ, Q. 

Another related set of problems is that of finding the best arm in a muli-armed bandit game ||20| . ETTl . Some 
approaches to these problems are similar in nature to Sequential Thresholding, namely the median elimination 
procedure of l22ll . but the problem of finding the best arm is fundamentally different than recovering S. Another 
difference is these procedures in general assume no knowledge of the distributions Pq and Pi, resulting in order 
optimal procedures at best. 

D. Organization 

The remainder of the paper is organized as follows. In Sec. [TT] we formalize the problem. Sec. IIII-AI derives the 
necessary condition on the number of samples required for exact recovery using any procedure. For comparison, 
Sec. IIII-BI derives a neccessary condition on the average number of measurements for non-sequential procedures. 
Next, Sec. |IV] analyzes the SPRT in the sparse setting and discusses some of the shortcomings of the test when exact 
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knowledge of the distributions is not available. Lastly, Sec. [VI introduces Sequential Thresholding and analyzes its 
performance. 

II. Problem Formulation 

Let iS be a subset of {1, ... , n} with cardinality s = \S\. For any index i G {1, . . . , n}, the random variable Y^j 
is independent, identically distributed according to 

{ Po i$S 

Y id ~l i = l,...,n (1) 

[ Pi ieS 

where Po an d Pi are probability measures with joint support on y. In words, the random variable Yij follows 
distribution Pl(-) if i belongs to S, and follows Po(-) otherwise, and j indexes multiple i.i.d. samples of any 
component i. We refer to Po as the null distribution, and Pi the alternative. 

Our analysis is concerned with exact recovery of the set S. Let S be an estimate of S. The family wise error 
rate is defined as: 

P e = P(«S ^ S) = P I |J Si U (J Si I (2) 

\i<?S ies J 

where Si is the event that an error is made at index i. 

In the lower bounds developed in this paper our consideration is limited to component wise procedures that test 
each index in an identical manner. This assumption implies that the individual error rates at each index are the 
same; specifically, P(£ 4 ) = P(£V) for all e S, and, likewise P(£ 4 ) = F(£ t ,) for all # S . Under this 
assumption we can simply notation and define the false positive and false negative rates: 

a = ¥ (S) /8 = Pi(£) 

where Po(£) = P(fi|i ^ S) and Pi(£ ) = P(£j|i S S). The component wise assumption implies the procedure only 
uses samples of component i to make inference about that particular component. More specifically, the decision to 
re-measure a particular component or include it in the estimate of S depends only on samples of that component. As 
the dimension of the problem grows large (which is our regime of interest), there is no loss of optimality associated 
with this restriction [19]. 

The log-likelihood ratio statistic comprised of multiple i.i.d. samples of a particular index is defined as: 

L^.-.^-Dog^. (3) 

Here, the superscript I explicitly indicates the number of samples used to form the likelihood ratio and is suppressed 
when unambiguous. The Kullback-Leibler divergence from distribution P\ to Pq is defined as: 

Pi(Y)' 



D(P 1 \\P )=Ei 



1Og P (Y) 
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-L 



(t) a.s 



where Ei [•] is expectation with respect to distribution Pi, which gives the usual convergence of the normalized 
likelihood ratio as t grows large: 

-D(P Q \\Pi) i^S 

(4) 

D(Pi||Po) i€S. 

It is sometimes convenient to state results in terms of the maximum of D(P \\Pi) and D(Pi\\P ). In this case, we 
define 

D KL = msx{D{P \\P 1 ),D{P 1 \\P )}. 

In order to bound rates of convergence of particular testing procedures, we make use of the variance of the likelihood 
ratio, denoted 



^(Pill^o) = var e s) = E 1 (log J^j - £>(Pi||P )) 



A sampling procedure T is a method used to determine the number of samples taken of each index. To be precise 
in characterizing a sampling procedure, we present three definitions. 

Definition 1. Sampling procedure. A binary valued function T on 6 {l,...,n}xN that defines the number 
of samples of Yi that are observed. Specifically, if Tij = 1, then Yij is observed, and can be used in estimation 
of S. Conversely, ifT^j = 0, then is not observed, and is not used in estimation of S. 

Definition 2. Non-sequential (fixed sample size) sampling procedure. Any sampling procedure T such that T^j is 
not a function ofYi^ji for any 



Definition 3. Sequential sampling procedure. A sampling procedure T in which Tij is allowed to depend on 
previous samples, specifically, Tij : {Y^i, . . . , Y^j-i} i->- {0, 1}. 

Sequential procedures can make use of information as it becomes available to adjust the sample size, while 
non-sequential procedures, or fixed sample size procedures, fix the number of samples taken a priori. Note that 
under this definition, the set of non-sequential procedures are a subset of sequential procedures. 

In order to make fair comparison between different procedures, we limit the total number of samples in expec- 
tation. For any procedure V we require 



E 



'■3 



< nm 



(5) 



for some m > 0. This simply implies, on average, the procedure uses m or fewer samples per dimension. 

The family wise error rate of any procedure used to estimate S depends on the underlying distributions P and 
Pi, the dimension, n, the level of sparsity s, and the average number of samples per component, m. Throughout, s 
and m are non-decreasing functions in n (and thus, the set S is also a function of n). We suppress this dependence 
on n for ease of exposition. Our focus will be on finding the relationship between the triple (n, s, m) such that 
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for any fixed distributions Pq and Pi, either lirrin-^oo P e = (the procedure is reliable) or lim„_ i . 00 P e > (the 
procedure is unreliable). Without loss of generality, we assume s < n/2 (since, of course, if s > n/2, one can 
re-label the problem, swapping Po and Pi). As we are interested in sparse problems, a few of the results require the 
assumption that s -c n. We often make the assumption that lim„_ i . 00 — = 0, which is termed sub-linear sparsity, 
but this scaling is stated explicitly when needed. 

III. Limits of Reliable Recovery 

This section presents lower bounds on the number of measurements required for reliable recovery by any procedure 
in both the sequential and non-sequential setting. The bounds are in terms of the expected number of samples per 
dimension. 

A. Limitation of Sequential Procedures 

The following theorem quantifies the limitations of any procedure, which includes both sequential and non- 
sequential procedures, as non-sequential procedures are a subset of sequential procedures (from Def. [2] and Def. |3). 
The bound applies to finite problems, but also implies a necessary rate of growth as n becomes large, captured in 
the ensuing corollary. 

Theorem 4. Finite sample limitations of sequential procedures. Any (sequential) procedure with 

< log s + log (jg) 
Dkl 

also has 

P e > 1 - e~ 5 w S 

where the approximation holds for small S. 

Proof: See Appendix lAl ■ 
Thm. |4] establishes a lower bound on the expected number of samples needed to achieve a particular family wise 
error rate. As the dimension of the problem grows, it provides us with a necessary condition for reliable recovery. 

Corollary 5. Limitations of sequential procedures. Assume linin^oo s/n = 0. Any (sequential) procedure with 

to 1 
lim < 



logs " £>(Po||Pi) 

> 1/5. 

Proof: Thm. |4] implies that if to < jy^- then P e > 1 — e -1 / 4 > 1/5. Dividing by logs and taking the limit as 
n — > oo would give the lemma if Dkl = D(Pq\\Pi). Instead, returning to the proof of Thm. |U it is easily verified 
that if limn-i.00 s/n — 0, the analysis follows with Dkl replaced by D(Pq\\Pi). ■ 
In words, if the number of samples per dimension grows at a rate slower than logarithmically in the level of 
sparsity, no procedure can reliably recover S. In shorthand notation, if to < wjiSfe^ then P e can not be driven 
to zero, and recovery of S is unreliable in the large n limit. 
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B. Limitation of Non-Sequential Procedures 

Non-sequential methods, which sample each index a fixed number of times, can require significantly more 
measurements than sequential procedures. In the following theorem we state a necessary condition on m for any 
reliable non-sequential procedure. The proof is based on analysis of the Chernoff 'Information [23 1. Our consideration 
is restricted to non-sequential procedures which sample each component i = 1, . . . , n exactly m times: 

1 j < m 

j > m. 

Theorem 6. Limitation of non-sequential procedures. Assume lirrin^oos/n = 0. Any non-sequential procedure 
with 

km < — ; — n — r . (6) 

n->oo logn D(Px\\Pq) 

also has 

lim P e > 0. 

n— ^oo 

Proof: See Appendix [B] ■ 

IV. Sequential Probability Ratio Testing 

A. The SPRT 

Provided Pq and Pi are known, sequential probability ratio tests are optimal for binary hypothesis tests in terms 
of minimizing the expected number of measurements for any error probabilities a and f3 (shown originally in ll24lD ; 
this optimality translates to the high dimensional case by simply considering n parallel SPRTs. 

Each individual SPRT operates by continuing to measure a component if the corresponding likelihood ratio is 
within an upper and lower threshold, and terminating measurement otherwise. For scalar thresholds 7l and 7u, the 
procedure is defined as 

I else 

where Y[j=i b}(y ' j -) ^ s t ' le likelihood ratio comprised of all prior samples. If the likelihood ratio falls below 7l, the 
SPRT labels index i as not belonging to S; if the likelihood ratio exceeds 7u, index i is assigned to S. Equivalently, 
the test can be implement in the log-likelihood domain, and L^f ' can be compared against log(TL) and log(Tu). 
The procedure requires a random number of samples of each component, denoted Ji, and defined as 

Ji ■= min{j : T iJ+ i = 0}. 

As we proceed we make a minor assumption on the distribution of the log-likelihood statistic. Specifically, the 
ensuing theorem and proof require existence of positive constants C\ and C-2 such that 

E [ £ W| £ W < log7L ] > log7L _ Cl E[4 Jt) \L^ > log 7u ] < log 7u + C 2 (8) 
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for all thresholds 7u and 7l- In some cases, bounds for C\ and C2 are known (see 1121 . p. 145, where explicit 
expressions for the Bernoulli and Gaussian case are given). In words, the requirement is the existence of a constant 
that bounds the expected value of the log-likelihood ratio when the procedure terminates, regardless of the value 
of the threshold. This condition is satisfied when l\ follows any bounded distribution, Gaussian distributions, 
exponential distributions, among others. It is not satisfied by distributions with infinite variance or polynomial tails. 
A more thorough discussion of this restriction is studied in 11251 . 

Theorem 7. Ability of the SPRT. The SPRT procedure with thresholds 7l = jtf; and 7u = (n — s) 1+e , any e > 0, 
has 

lim P e = 

and 

lim < 



logs " D(Poll-Pi)" 
provided s < n/ logn, and the condition in @ is satisfied. 

Proof: See Appendix [C] ■ 
B. Implementation Issues 

Implementing an SPRT on each component can be challenging for many problems encountered in practice. While 
the SPRT is optimal when both Pq and Pi are known and testing a single component amounts to a simple binary 
hypothesis test, scenarios often arise where some parameter of distribution Pi is unknown. When some parameter 
of Pi is unknown, the likelihood ratio cannot be formed, and sufficient statistics for the likelihood ratio result in 
adjustments to the thresholds based on the unknown parameters of distribution Pi. With incorrect thresholds, the 
SPRT is no longer optimal. To see this more concretely, consider a problem where Pi is a normal distribution 
with an unknown positive mean p, and Pq is a zero mean standard normal distribution. Here, the SPRT procedure 
continues to sample a particular index if 



<7E^<^ + f, (9) 



equivalent to (0. While the statistic Ylj=i ^ ,3 c ' oes not depend on the unknown parameter fi, the thresholds do. If 
the test is implemented with an incorrect value of p, it may continue to sample an index ad infinitum (the so-called 
open continuation region |26|). This occurs when /i is overestimated; consider a scenario in which the threshold is 
set using /x = lOp, where /i is the true mean of P x . If Pi is the true distribution, we have jj Yjj=i ~ ^(A 4 ) Vj')' 
which, with high probability, will not exceed the upper threshold (which is greater than 5[i). 

V. Sequential Thresholding 

Sequential Thresholding is based on simple idea: repeatedly reduce the dimension of the problem by sequentially 
eliminating elements that exhibit strong evidence they don't belong to S. Sequential Thresholding consists of a 
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series of K measurement steps, where each step eliminates from consideration a proportion of the components 
measured on the prior step. After the last step, the procedure terminates, and the remaining components are taken 
as the estimate of S. 

To illustrate the main idea behind the procedure, we first introduce a simplified version of Sequential Thresholding 
and analyze the simplified procedure for a specific problem. This simple Sequential Thresholding, while not achieving 
asymptotic optimality, does admit a simple error analysis. The more general version of Sequential Thresholding, 
which does achieve optimality and the lower bound of Cor. [5] is presented in the second half of this section. 

A. Example of Simple Sequential Thresholding 

To highlight the main idea behind Sequential Thresholding, and the potential performance gains, consider a 
problem where Po ~ Af(0, 1) and Pi ~ Af(8, 1) for some 6 > 0. The simple Sequential Thresholding procedure 
requires two inputs - 1) the number of steps, K k, logn, and 2) an even integer to > 2 that defines the average 
number of samples per index, and hence the total budget. On the first step the procedure samples all indices 
to/2 times each, for all i, requiring ran/ 2 samples. These to/2 samples are summed for each index. If this sum, 
Y^jZi Yi, is less than zero, that particular index is not sampled on subsequent passes. This eliminates a proportion 
(approximately half) of components following the null distribution (since the median of 53j=i f° r * ^ 5 is 
zero). Indices that exceed the threshold, i.e. {i : Y^jZi > 0}> aie sampled on the subsequent step. This process 
continues for K ss log 2 n steps. After the Kth step, the procedure terminates, and estimates S as the set of indices 
that have not been eliminated from consideration. Roughly speaking, provided s <C n, the procedure reduces the 
number of samples taken on each step by half as most components follow the null, which is zero mean. The total 
number of samples required by the procedure on all passes is ran/ 2 + ran/ 4 + mn/8 + ■ ■ ■ ~ mn. This implies, 
on average, m or fewer samples per dimension. 

Algorithm 1 Simple Implementation of Sequential Thresholding 
input: K « log n steps, budget m > 2 

initialize: Si = {1, . . . , n} 

for k = 1, . . . ,K do 



for i e Sk do 




threshold: <S fe+1 := {i e S k : YJ'il > 0} 



end for 



end for 



output: Sk+i 



Corollary 8. Ability of simple Sequential Thresholding. Let K = 



+ e) log 2 n\ for any e > and consider 
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the setting above where Pq ~ 7V(0, 1) and Pi ~ J\f(6, 1). The simple Sequential Thresholding algorithm satisfies 
liniyi^o P e = provided 

logs + loglog 2 n 



m > 



6*2/8 



Proof: From a union bound, 



<(n-s)a + s/3. (10) 



The false positive event occurs when, for i £ S, the index survives all K thresholding steps. By the independence 
across steps, and since the median of Y^=i ^ f° r * G 5 is zero, 

l\ K /i\ (1+e) 



The false negative event occurs when for some i £ <S, the sum J^jlli ^ f a ^ s t> e l° w zero on an y step- Applying a 
union bound and Gaussian tail bound, since 53j=i ~ Af (m6/2, m/2), we have 

/3 < ycxp (-m6> 2 /4) 

< (1 + e) log 2 (w) exp 

< exp(-m6» 2 /4 + log((l + e)log 2 n)). (12) 

Combining ( [Tol l. ( fTTT ) and ( IT2l > gives 

Imposing the condition in the theorem, and taking the limit as n — > oo gives the desired result. ■ 
While the sub-optimal simple version of Sequential Thresholding does not achieve the lower bound, it does out- 
perform non-sequential procedures. The procedure requires the average number of samples per dimension, m, to be 
order log s + log log n for successful recovery. On the other hand, Sec. IIII-BI shows non-sequential methods require 
m on the order of logrt samples. For large n and small s, logn can be significantly larger than logs + log log n, 
implying that Sequential Thresholding, for sufficiently sparse problems, will succeed with fewer samples than any 
non-sequential procedure. 



B. Details of Sequential Thresholding 

While the previous discussion highlighted the main principle behind Sequential Thresholding, the procedure 
becomes slightly more complicated in its full generality. To the show procedure achieves the lower bound of Cor. [5] 
as n grows large, both the allocation of measurements across steps and the proportion of null components discarded 
on each step must be adjusted. 

In general, Sequential Thresholding requires three inputs: 1) K, the number of steps, 2) a constant p G [1/2, 1) 
representing the proportion of null components discarded on each step, and 3) a total measurement budget mn 
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(implying m samples per dimension). The expected proportion of null components discarded on each step, p, is 
fixed throughout the procedure and defined as 



(13) 



Here, uik is the number of samples used on any particular index measured on step k. As nik is a function of the 
step index, so is the threshold jk- 

With p and K as inputs, and a total expected measurement budget ran, Sequential Thresholding operates as 
follows. Let St denote the subset of {1, . . . , n} comprised of components still under consideration at step fc. The 
procedure first initializes by setting Si = {1, . . . , n}. For steps k = 1, . . . , K, the procedure proceeds as follows. 
On step fc, each component in S k is sampled mk times. The number of samples taken on step k is defined as 

n 



mk p 



(14) 



v n + sK 2 1 

The procedure then compares the likelihood ratio comprised of the rau samples to the threshold defined in dT3l > and 
includes only the indices that exceed the threshold in the set of components to be sampled on the following step: 



S 



k+l 



where j k is defined in ( fT3l) . In words, if L ™' is below j k , no further measurements of component i are taken 
for the remainder of the procedure. Otherwise, component i is measured on the subsequent step. By definition of 
7fc, approximately p times the number of remaining components following Pq will be eliminated on each step; if 
s n, each thresholding step eliminates approximately p times the total number of components remaining. After 
step K, the procedure terminates and estimates S as the indices still under consideration: S = Sk+i- The procedure 
is detailed in Alg. [2] 



Algorithm 2 Sequential Thresholding 



input: K 



log. 



2(n-s) 



steps, p £ [1/2, 1), maximum budget m 



initialize: Si = {1, . . . , n} 
for k = 1, . . . ,K do 
for i e Sk do 

measure: {Ygfe r 



threshold: Sk+i 
end for 
end for 

output: Sk+i 



where mk is given in (fl4l > 



Lj = l -L \ I. J 
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C. Measurement Budget 

By design, Sequential Thresholding satisfies the measurement budget in (0. Consider the expected number of 
samples required by Sequential Thresholding: 



E 



h3 



|J St IE 



i£5 



(15) 





E 


E r « 




\ies / 






ies 



where the equality follows from the law of total probability and conditioning on one or more false negative events. 
From the description of the procedure, one or more false negatives can only reduce the total number of samples, 
and we have 



E 



E r « 




> E 


E r - 


U* 




ies 






ies 



Combining this with < TT3T > gives 



E 


E r - 


< E 


E r -- 








K 




ies 



fe=i 

K 



< 



fe=l 

< mn 

< mn. 



E-(^^) V((l-p) fe - 1 (n-.s) + .s) 



n + s-ftT 2 

n — s + sK 2 
n + sK 2 



Here, the equality follows from independence of the samples across the K steps. The third inequality follows as 
the sum is a geometric series for p 6 [1/2, 1) and as Y^=i & — -^ 2 - 

Z). Ability of Sequential Thresholding 

For fixed Po and Pi, the following theorem and corollary relate (n, s, m) to the family wise error rate of the 
procedure. The corollary follows from the more general Thm. [10] 



Corollary 9. Ability of Sequential Thresholding. If 



m 

lim ; > 



1 



»logs P>(Po||Pi) 



then sequential threshold satisfies 



lim P e = 



14 



with input parameters K = |Tog(2(n — s))/ (21oglogs) — 1/2] and p=l — ^ provided s < n/(logn) 2 and 



linir. 



oo. 



Proof: The proof follows from Thm. QJJby setting 6 = jj^j, which implies linin^oo P e = 0. By setting K 
and p as defined in the statement of the theorem, lim„_ s . 00 c„ = D(Pq\\Pi), where c n is defined in ( TToT l. Together 
with the forward part of the theorem, this implies the corollary. ■ 
Comparison of Cor. [9] to Cor. [5] shows that Sequential Thresholding is asymptotically optimal in terms of 
the required number of samples needed for reliable recovery. We continue with the main theorem of Sequential 
Thresholding, which quantifies the expected number of samples per dimension in the finite setting. The theorem is 
in terms of a sequence, c n , which, under certain conditions, approaches the Kullback-Leibler divergence between 
Pq and Pi. Proof of the theorem relies on techniques closely related to the Chernoff-Stein Lemma, and is found 
in the Appendix. 

Theorem 10. Finite sample performance of Sequential Thresholding. Consider Sequential Thresholding with K = 



steps and measurement allocation in M4\ . Provided 
log s + log <5 _1 + log 4 



m > 



then 



where 



< S 



sK 2 



D(Po\\Pi) 



\ 



° 2 (Po\\Pi) 



p 2 n log s 



D(P \\P 1 )(n+sK2) 



-1 



(16) 



and is assumed to be positive. 



Proof: See Appendix ID] ■ 
Thm. [10] and Cor. [9] imply that as the size of the problem increases (as n goes to infinity), if m is greater than 
Z) (Pol I -Pi) 1 logs, the procedure will succeed in exact recovery of the sparse support set. This achieves the lower 
bound in Cor. [5] which states that any reliable procedure requires at least D(Pq | log s samples per dimension. 

This implies that Sequential Thresholding is in a sense first order optimal. While not discussed here, one could 
also analyze the rate at which the procedure approaches the lower bound in Cor. [5] although the authors suspect 
the procedure would not achieve second order optimality. 



E. Implementation and Practical Concerns 

One of the main attributes of Sequential Thresholding is that implementation does not require exact knowledge 
of distribution Pi. While apparent in the example of simple Sequential Thresholding at the beginning of the section, 
in which two normal distributions are compared, this appealing property also extends to other settings. To be more 
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specific, consider a sparse recovery problem in which Pq is known (or well approximated), but Pi is defined by a 
parametric family of distributions with an unknown parameter 9. Clearly the likelihood ratio, 



T which does not depend on the unknown parameter 9, then an equivalent test based on T can be written; ( fT3b 
can be written in terms of T. As the test statistic and the thresholds, jk, depend only on Pq, the procedure can be 
implemented without knowledge of 9. 

In practice this occurs for many distributions in the exponential family as the log-likelihood ratio, Li, defined in 
(01, is a monotonic function of a test statistic T that does not depend on parameters of Pi. This property is well 
illustrated by the example of testing two Gaussian distributions discussed in the beginning of the section. If we 
assume the null distribution is known, but the offset mean of Pi is unknown, the procedure can still be implemented. 
The sum of the measurements, J2j Yi,j> ^ s a sufficient statistic whose distribution under Pq does not depend on P\. 
This implies that 7& does not depend on P\, and thus, the procedure can be implemented without this knowledge. 

There are two possible implementations of Sequential Thresholding which we refer to as parallel and scanning. 
The parallel implementation samples and tests all n components in parallel according to the procedure. The scanning 
implementation measures and tests the n components in a sequence (which can be arbitrary). For example, the 
scanning implementation can begin with component i = 1 and repeatedly measure and threshold the observations 
up to K times. If an observation falls below the threshold at any point, then the scanning procedure immediately 
moves on to the next component. If K observations are made without an observation falling below the threshold, 
then the component is added to the set Sk+i- 

The two implementations are equivalent from a theoretical perspective. The parallel implementation may be more 
natural for large-scale experimental designs (e.g., in the biological sciences), whereas the scanning implementation 
is more appropriate in communications applications such as spectrum sensing. The latter also reveals natural 
connections between Sequential Thresholding and the SPRT. 



This paper showed that sequential methods for support recovery of high dimensional sparse signals in noise 
can succeed using far fewer measurements than non-sequential methods. More specifically, non-sequential methods 
require the number of measurements to grow logarithmically with the dimension, while sequential methods succeed if 
the number of measurements grows logarithmically with the level of sparsity. A simple procedure termed Sequential 
Thresholding was shown to achieve the lower bound asymptotically. Sequential Thresholding can be implemented 
without full knowledge of the underlying distributions, making it more practical for sparse recovery encountered 
in real world problems. 




cannot be formed as 9 is unknown. None-the-less, if the likelihood ratio is a monotonic function of a test statistic 



VI. Conclusion 
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Appendix A 
Proof of TheoremO 

Proof: We first bound the family wise error rate in terms of the false positive and false negative probabilities 
associated with incorrectly assigning or excluding any element from S. From (g), 

|J £ i u (J Si 

v i0S i£S 

= i-p(f|^nf|^ 

\i£S i£S 

By the assumption of a coordinate wise test, 

P e = 1 - (1 -/3) s (l -a) n - s 

> 1 - e -Ps e -<*(™-s) (17) 

where the last inequality follows as (1 — (3) n < e~ /3n for (3 G [0, 1] and n G {1, . . . }. To continue, we can bound 
the expected number of samples associated with any particular index. From |26|, Thm. 2.39, the following holds 
for any binary hypothesis test 

alog( T ^)+(l-a)log(i^) 

mo - mm 

where mo is the expected number of samples of any component i £ S. We can further bound the expected number 
of samples as 

a log a + (1 — a) log(l — a) + (1 — a) log /3 _1 



m > 



> 



D(Po\\Pi) 
(1 - a) log/?" 1 - log 2 



D(Po\\Pi) 

where the first inequality follows as alog(l/(l — /?)) > 0, and the last inequality follows as aloga + (l — a) log(l- 
a) > log(l/2), all a <E [0, 1]. Likewise 

(l-/3)log(i^) +/31og( T Aj^ 



mi > 



> 



D(Pi\\Po) 
(1 -^loga" 1 - log 2 



D(Pi\\Po) 

where mi is the average number of samples given i G S, and the first inequality is again from Thm. 2.39 of l26l . 

Let D KL = max^Poll^U^OPill^o)}- We have 

(n — s)mo + s mi 

m = 

n 

^ in — s)(l — a) log/3 -1 + s(l — (3) logo; -1 — nlog2 

nD Kh 
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If ot < j3 we have 



m > 



(n - s)(l - /?) log + s(l - /3) log /3" 1 - n log 2 



u5 



KL 



> 



(l-/3)logr 1 -log2 
log (^) - log 2 



KL 



where the last inequality is easily verified for f3 G [0, 1]. 

Imposing the condition in the forward part of the theorem, m < (logs + log(45) _1 )/Z?KL gives 



logs + log (£) log (i)- log 2 



> m > 



KL 



KL 



which implies 



and thus 



- log ( ^) > log ^ ) - l<»g 2 



P > -■ 

s 



Hence, 



> 1 - e - s e- (n - s)c 



> 1 -e" 



Conversely if /3 > q 



and 



which, provided s < n/2, gives 



m > 



log (2b) -tog 2 

£>KL 



> 1 - e~ sp e 



> 1 -e" 



completing the proof. 

Appendix B 
Proof of Theorem[6] 
Proof: We write the family wise error rate as: 

U £ i u (J ^ 

ieS 

l-(l-/3) s (l-a)' 



(18) 



> 1 - e -<*(n-s) e -Ps 



which cannot be driven to zero if 



a > 



Equivalently, if 



lim — log a 1 < lim — log(n — s) 

m— >oc 777, m— >oo 777 



From 1231 . p. 386, (Chernoff Information), 



1 



where 



for A 6 [0,1]. Thus, if 



lim -loga- 1 = D(P X \\P ) < DfrWPo) 

m— >oo 777 



In P oPi~ X dy 



lim — log(ra - s) > D(P 1 \\P ) 

m— foo 777 

limn-j.oo P e > 0. Since lim n _ > . 00 m = oo and linin^oo n — s — n, this implies the result. If 

777 1 

lim < 



log 77 D(Pi||P ) 



then limn-^oo F e > 0. 



Appendix C 
Proof of Theorem[7] 

Proof: For an SPRT with thresholds 7l and 70, from ll26l . the following well known inequalities hold 



a - ^ = (77 - V +£ 



1 



< 7L = — 

s 1 



From a union bound on the family-wise error rate 

lim P e < lim (77 — s)a + s/3 = 



n— y-oo n—too 



implying the forward portion of the lemma. 

We can write the expected number of measurements per dimension as 

(n - s)E [Ji] + s Ei [Ji_ 



By Wald's identity 



Ei \JA 



Ex 



L 



(Ji) 



Ei 
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and similarly, Eq [Ji 



D(Pa\\Pi) 

m 



lim 

n->oo log s 



. Dividing by log s and taking the limit, we have 

(n - s)Eo [Ji] + s Ei [J»] 



lim ■ 

n— f oo 



nlogs 



-(n-s)Eo 
i^o nlog(s)P>(P 1 1 Pi ) 



(Ji) 



lim 



sEi 



< lim (—^fV^^lim 

n->-oo nlog(s)D(Po\\Pi) «-><> 

where the inequality follows by the assumptions in ([8} and as 



nlog(a)£>(Pi||P ) 
s(log7u + C 2 ) 



■,log(a)£>(P ||Pi) 



-E 



= - (l-a)E 



L 



(J.) 



L,[ Jl) <log 7L 



+ aE 



-(Ji) 



> 7u 



< (l-a)(log 7l ; 1 +Ci)+a(log 7 u 1 -C 2 ) 

< (l-a)(log7L 1 + Ci) 

< log 7l : 1 +Ci 



and likewise, 



Ei 



L 



(J,) 



Using the prescribed values of 7u and 7l gives 



lim 



< log7u + C 2 . 



1 + e 



n^oologS I>(Po||Pl) 

provided s < nj log n, completing the proof. 

Appendix D 
Proof of Theorem [Tol 
Proof: From the union bound on the family wise error rate, we have 

P e < {n-s)a + sf3. 

The false negative event is given as 



< 



5> (i£ fe) <7 fe ) 



fc=i 



(24) 



(25) 



We continue by bounding the above probability. The following analysis is closely related to the Chernoff-Stien 
lemma [23], but modified for one sided tests. Let %/k = (yi, ■ ■ ■ , y mk ) and define the region Ak C M mfe as 



For all y/j in Ak, by definition, 



A:= {Vk :L^\y k ) < lk }. 
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which implies 



where P™ k (y k ) — 117= 1 Pi(Vj)- Again by definition (suppressing subscript k for ease of notation) 



»i (L (mk) < 7k) = [ PT k (v)dy 

V ' JyeA 



< 



yeA 



e^P pm (y)dy 



e 7fc f P pm (y)dy 

JyeA 



< e lk . 



(26) 



The above relationship holds for any j k , though only proves meaningful if set correctly. To this end, for some 
e k > 0, let 

7 fe = -mfc(£>(P ||Pi)-e fc ) (27) 

which from d26l i gives 

Pi (^L {mk) < -yj < e ~ m '=( D ( p oll p i)- e '=). 

It remains to show what values of e k simultaneously satisfy (\3[ for any p € [1/2, 1). Specifically, we need to find 
the range values of e k that satisfy 

Po (i (mfc) < -m k (D(P \\Pi) - e k j) > p. 

Proceeding, 



» <-m fc (U(P ||Pi)-e fc )) 



\ ( — i ( " lfc) +^(Po||Pl) <£fc 



? D(P ||Pi) $>g 

I TT7 7. fc * 



Po(Y 3 ) 
m k jr'~ a P 1 {Y j ) 



< e k 



> 1 



° 2 (Po\\Pi) 
m k e 2 k 



(28) 



where the last line follows from Chebyshev's inequality. To insure the probability a null component is discarded 
on any step is greater than p, we have the following condition 



e k > 



! <r 2 (Po\\Pi) 



TOfc(l - p) 

As m k is smallest for k = 1, from the definition of m k in ( fT4t 



mp 2 n 

m k > mi > — 1 



and the condition is implied for all k provided 



^ 2 (^o||Pi) 



\ (^r-l)(l-P)' 



(29) 
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To summarize developments thus far, we've shown that if jf. = — m/j(-D(Po||-Pi) — e k) then both 



and 



for any e/. that satisfies 

Continuing with rrifc as specified in ( TT4-b 

p < EPi(^ fc) < 



7fe 



< 



k=l 

E« 

fc=i 



A' 



-m fc (D(P ||Pi)-e fc ) 



< 



E eX P 



fc=l 



— mk p 2 



n + sK 2 



D(Po\\Pi) 



* 2 (Po\\Pi) 



n+sK 2 



i ){1-P) 



V 



< 



1 - e - mc " 

where the last inequality follows as the sum is a geometric series. 



With K = 



(^) 



the false positive rate is then 

a < (l-p) K 

< (1 - p) T=7T 

5 

< 



2(n - s) 



Combining ( f30b and d3Tb gives 



< - + s - 

~ 2 1 - e - m < 



Next, from the statement of the theorem, let 



m > 



Mir) 



(30) 



(31) 



where 



c n = P 



+ sK 2 



( 



£(Po||Pi)- 



^ 2 (^o||Pi) 



^ \ \D(P \\P 1 )°n+sK^) 1 ) ( X 



Notice c < Z)(P ||Pi). Thus, m > 



logs 



D(P \\P: 

m > 



, and 

Mt) >Mf) 



r' 



(32) 
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where 



4 = P 2 



n 



+ sK 2 



D(P \\Pi) 



* 2 (Po\\Pi) 



\ G^-i)(w) 



As m > 



< 



S 5/4 



2 l-<5/(4s) 



6 



(33) 



where the last inequality holds for s > 1, S < 1 which proves the theorem. 



